Using a Natural Cubic Spline in Place of the Quadratic Polynomial

For simplicity, we have introduced linear models for quantitative outcomes using a quadratic polynomial to represent the effect of a single quantitative predictor. In practice, we would typically use a natural cubic spline instead of a polynomial.

Using the EFFECT statement and non-positional syntax, it is straightforward to modify our code to use a natural cubic spline with 4 df in place of the quadratic polynomial.

Initial Fit and Diagnostic Plots

First, plot the residuals vs the predicted values.

LIBNAME NHANES "../../../BMI552_Datasets/NHANES_May2023/";
*LIBNAME NHANES "~/my_shared_file_links/u48718377/NHANES";
OPTIONS FMTSEARCH=(NHANES.formats_NHANES work) NODATE NONUMBER;
TITLE1; TITLE2;
ODS NOPROCTITLE;
ODS GRAPHICS / LOESSMAXOBS=20000;

DATA Analysis;
  SET NHANES.NHANES (RENAME=(Race1=Race));
RUN;

ODS EXCLUDE ALL;
PROC GLMSELECT DATA=Analysis;
  ODS OUTPUT FitStatistics=F;
  EFFECT AgeCS = Spline(Age / 
                NATURALCUBIC BASIS=TPF(NOINT) 
                KNOTMETHOD=PERCENTILELIST(5 27.5 50 72.5 95));
    
  MODEL TotChol = AgeCS / SELECTION=NONE;
    
  OUTPUT OUT=Fitted P=Pred R=Resid;
  STORE SplineModel;
RUN;
ODS EXCLUDE NONE;
PROC PRINT DATA=Fitted (OBS=10);
  VAR Age TotChol Pred Resid;
  FORMAT Pred 6.2 Resid 6.2;
RUN;

PROC SGPLOT DATA=Fitted;
  LOESS X=Pred Y=Resid / LINEATTRS=(COLOR="Red");   
RUN;
Obs Age TotChol Pred Resid
1 39 4.94 5.13 -0.19
2 37 6.36 5.07 1.29
3 34 7.24 4.97 2.27
4 40 6.28 5.16 1.12
5 52 3.1 5.35 -2.25
6 58 4.16 5.33 -1.17
7 61 4.86 5.29 -0.43
8 79 4.01 4.88 -0.87
9 22 3.59 4.58 -0.99
10 37 4.11 5.07 -0.96

The SGPlot Procedure 4.6 4.8 5.0 5.2 5.4 Pred -2.5 0.0 2.5 5.0 7.5 Resid Loess

Plot the square root of the absolute value of the residuals vs the fitted values.

DATA Fitted;
  SET Fitted;
    
  Sqrt_Abs_Resid = SQRT(ABS(Resid));
RUN;

PROC SGPLOT DATA=Fitted;
  LOESS X=pred Y=sqrt_abs_resid; 
  REG X=Pred Y=Sqrt_Abs_Resid / NOMARKERS LINEATTRS=(COLOR="green");
RUN;
The SGPlot Procedure 4.6 4.8 5.0 5.2 5.4 Pred 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Sqrt_Abs_Resid Regression Loess

Plot the residuals and square root of the absolute value of the residuals against each predictor variable.

PROC SGPLOT DATA=Fitted;
  LOESS X=Age Y=Resid / LINEATTRS=(COLOR="Red");   
RUN;

PROC SGPLOT DATA=Fitted;
  LOESS X=Age Y=Sqrt_Abs_Resid; 
  REG X=Age Y=Sqrt_Abs_Resid / NOMARKERS LINEATTRS=(COLOR="green");
RUN;
The SGPlot Procedure 20 30 40 50 60 70 80 Age (yrs) -2.5 0.0 2.5 5.0 7.5 Resid Loess

The SGPlot Procedure 20 30 40 50 60 70 80 Age (yrs) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Sqrt_Abs_Resid Regression Loess

Examine a normal scores plot for the residuals.

PROC UNIVARIATE DATA=Fitted NOPRINT;
  QQPLOT Resid / NORMAL(MU=EST SIGMA=EST);
RUN;
Q-Q plot for Resid -4 -2 0 2 4 Normal Quantiles -5.0 -2.5 0.0 2.5 5.0 7.5 10.0 Resid Mu=0, Sigma=1.0295 Normal Line Q-Q Plot for Resid

Formal Model Comparisons (AIC/BIC)

AIC or BIC can be used to compare different model specifications.

The following SAS code calculates AIC for models using natural cubic splines with 2-6 df and our original quadratic model.

ODS EXCLUDE ALL;
PROC GLMSELECT DATA=Analysis;
  ODS OUTPUT FitStatistics=Quadratic;
  EFFECT AgeCS = Polynomial(Age / DEGREE=2);

  MODEL TotChol = AgeCS / SELECTION=NONE;
RUN;

DATA Quadratic;
  LENGTH Model $30;
  SET Quadratic;

  IF Label1="AIC";
  Model = "Quadratic";
  AIC = nValue1;
  KEEP Model AIC;
RUN;

PROC GLMSELECT DATA=Analysis;
  ODS OUTPUT FitStatistics=DF2;
  EFFECT AgeCS = Spline(Age /
                NATURALCUBIC BASIS=TPF(NOINT)
                KNOTMETHOD=PERCENTILELIST(10 50 90));

  MODEL TotChol = AgeCS / SELECTION=NONE;
RUN;

DATA DF2;
  LENGTH Model $30;
  SET DF2;

  IF Label1="AIC";
  Model = "2 df";
  AIC = nValue1;
  KEEP Model AIC;
RUN;

PROC GLMSELECT DATA=Analysis;
  ODS OUTPUT FitStatistics=DF3;
  EFFECT AgeCS = Spline(Age /
                NATURALCUBIC BASIS=TPF(NOINT)
                KNOTMETHOD=PERCENTILELIST(5 35 65 95));

  MODEL TotChol = AgeCS / SELECTION=NONE;
RUN;

DATA DF3;
  LENGTH Model $30;
  SET DF3;

  IF Label1="AIC";
  Model = "3 df";
  AIC = nValue1;
  KEEP Model AIC;
RUN;

PROC GLMSELECT DATA=Analysis;
  ODS OUTPUT FitStatistics=DF4;
  EFFECT AgeCS = Spline(Age /
                NATURALCUBIC BASIS=TPF(NOINT)
                KNOTMETHOD=PERCENTILELIST(5 27.5 50 72.5 95));

  MODEL TotChol = AgeCS / SELECTION=NONE;
RUN;

DATA DF4;
  LENGTH Model $30;
  SET DF4;

  IF Label1="AIC";
  Model = "4 df";
  AIC = nValue1;
  KEEP Model AIC;
RUN;

PROC GLMSELECT DATA=Analysis;
  ODS OUTPUT FitStatistics=DF5;
  EFFECT AgeCS = Spline(Age /
                NATURALCUBIC BASIS=TPF(NOINT)
                KNOTMETHOD=PERCENTILELIST(5 23 41 59 77 95));

  MODEL TotChol = AgeCS / SELECTION=NONE;
RUN;

DATA DF5;
  LENGTH Model $30;
  SET DF5;

  IF Label1="AIC";
  Model = "5 df";
  AIC = nValue1;
  KEEP Model AIC;
RUN;

PROC GLMSELECT DATA=Analysis;
  ODS OUTPUT FitStatistics=DF6;
  EFFECT AgeCS = Spline(Age /
                NATURALCUBIC BASIS=TPF(NOINT)
                KNOTMETHOD=PERCENTILELIST(2.5 18.33 34.17 50 65.83 81.67 97.5));

  MODEL TotChol = AgeCS / SELECTION=NONE;
RUN;

DATA DF6;
  LENGTH Model $30;
  SET DF6;

  IF Label1="AIC";
  Model = "6 df";
  AIC = nValue1;
  KEEP Model AIC;
RUN;

DATA ModelComparisons;
  SET Quadratic DF2 DF3 DF4 DF5 DF6;
RUN;

PROC MEANS DATA=ModelComparisons MAX NOPRINT;
  VAR AIC;
  OUTPUT OUT=MinIC MIN= / AUTONAME;
RUN;

DATA ModelComparisons;
  IF _N_=1 THEN SET MinIC;
  SET ModelComparisons;

  AIC = AIC - AIC_Min;

  KEEP Model AIC;
RUN;
ODS EXCLUDE NONE;

TITLE "Model Comparisons (AIC)";
PROC PRINT DATA=ModelComparisons NOOBS;
  VAR Model AIC;

  FORMAT AIC 6.1;
RUN;
TITLE;

Model Comparisons (AIC)

Model AIC
Quadratic 3.6
2 df 1.2
3 df 0.0
4 df 1.5
5 df 3.4
6 df 4.9

Based on AIC, the optimal model is the natural cubic spline with 3 df.

The following SAS code calculates BIC for models using natural cubic splines with 2-6 df.

ODS EXCLUDE ALL;
PROC GLMSELECT DATA=Analysis;
  ODS OUTPUT FitStatistics=Quadratic;
  EFFECT AgeCS = Polynomial(Age / DEGREE=2);

  MODEL TotChol = AgeCS / SELECTION=NONE;
RUN;

DATA Quadratic;
  LENGTH Model $30;
  SET Quadratic;

  IF Label1="SBC";
  Model = "Quadratic";
  BIC = nValue1;
  KEEP Model BIC;
RUN;

PROC GLMSELECT DATA=Analysis;
  ODS OUTPUT FitStatistics=DF2;
  EFFECT AgeCS = Spline(Age /
                NATURALCUBIC BASIS=TPF(NOINT)
                KNOTMETHOD=PERCENTILELIST(10 50 90));

  MODEL TotChol = AgeCS / SELECTION=NONE;
RUN;

DATA DF2;
  LENGTH Model $30;
  SET DF2;

  IF Label1="SBC";
  Model = "2 df";
  BIC = nValue1;
  KEEP Model BIC;
RUN;

PROC GLMSELECT DATA=Analysis;
  ODS OUTPUT FitStatistics=DF3;
  EFFECT AgeCS = Spline(Age /
                NATURALCUBIC BASIS=TPF(NOINT)
                KNOTMETHOD=PERCENTILELIST(5 35 65 95));

  MODEL TotChol = AgeCS / SELECTION=NONE;
RUN;

DATA DF3;
  LENGTH Model $30;
  SET DF3;

  IF Label1="SBC";
  Model = "3 df";
  BIC = nValue1;
  KEEP Model BIC;
RUN;

PROC GLMSELECT DATA=Analysis;
  ODS OUTPUT FitStatistics=DF4;
  EFFECT AgeCS = Spline(Age /
                NATURALCUBIC BASIS=TPF(NOINT)
                KNOTMETHOD=PERCENTILELIST(5 27.5 50 72.5 95));

  MODEL TotChol = AgeCS / SELECTION=NONE;
RUN;

DATA DF4;
  LENGTH Model $30;
  SET DF4;

  IF Label1="SBC";
  Model = "4 df";
  BIC = nValue1;
  KEEP Model BIC;
RUN;

PROC GLMSELECT DATA=Analysis;
  ODS OUTPUT FitStatistics=DF5;
  EFFECT AgeCS = Spline(Age /
                NATURALCUBIC BASIS=TPF(NOINT)
                KNOTMETHOD=PERCENTILELIST(5 23 41 59 77 95));

  MODEL TotChol = AgeCS / SELECTION=NONE;
RUN;

DATA DF5;
  LENGTH Model $30;
  SET DF5;

  IF Label1="SBC";
  Model = "5 df";
  BIC = nValue1;
  KEEP Model BIC;
RUN;

PROC GLMSELECT DATA=Analysis;
  ODS OUTPUT FitStatistics=DF6;
  EFFECT AgeCS = Spline(Age /
                NATURALCUBIC BASIS=TPF(NOINT)
                KNOTMETHOD=PERCENTILELIST(2.5 18.33 34.17 50 65.83 81.67 97.5));

  MODEL TotChol = AgeCS / SELECTION=NONE;
RUN;

DATA DF6;
  LENGTH Model $30;
  SET DF6;

  IF Label1="SBC";
  Model = "6 df";
  BIC = nValue1;
  KEEP Model BIC;
RUN;

DATA ModelComparisons;
  SET Quadratic DF2 DF3 DF4 DF5 DF6;
RUN;

PROC MEANS DATA=ModelComparisons MAX NOPRINT;
  VAR BIC;
  OUTPUT OUT=MinIC MIN= / AUTONAME;
RUN;

DATA ModelComparisons;
  IF _N_=1 THEN SET MinIC;
  SET ModelComparisons;

  BIC = BIC - BIC_Min;

  KEEP Model BIC;
RUN;
ODS EXCLUDE NONE;

TITLE "Model Comparisons (BIC)";
PROC PRINT DATA=ModelComparisons NOOBS;
  VAR Model BIC;

  FORMAT BIC 6.1;
RUN;
TITLE;

Model Comparisons (BIC)

Model BIC
Quadratic 2.4
2 df 0.0
3 df 5.7
4 df 14.0
5 df 22.8
6 df 31.2

Based on BIC, the optimal model is the natural cubic spline with 2 df, which fits decisively better than the natural cubic spline with 4 df.

Note that we would not generally simplify our model based on these types of comparisons.

Inference

Given a satisfactory model, we can obtain compatability intervals for means and differences in means using ESTIMATE statements.

ODS EXCLUDE ALL;
PROC PLM RESTORE=SplineModel;
  ESTIMATE "Age 35" Intercept 1 AgeCS [1, 35] / CL ALPHA=0.03125;
  ESTIMATE "Age 45" Intercept 1 AgeCS [1, 45] / CL ALPHA=0.03125;
  ESTIMATE "Age 55" Intercept 1 AgeCS [1, 55] / CL ALPHA=0.03125;
  ESTIMATE "Age 65" Intercept 1 AgeCS [1, 65] / CL ALPHA=0.03125;
  ODS OUTPUT Estimates=E;
RUN;
ODS EXCLUDE NONE;

PROC PRINT DATA=E NOOBS LABEL;
  VAR Label Lower Estimate Upper;
  LABEL Label='00'x;
  FORMAT Estimate 6.3 Lower 6.3 Upper 6.3;
RUN;

ODS EXCLUDE ALL;
PROC PLM RESTORE=SplineModel;
  ESTIMATE "Age 45 v 35" AgeCS [1, 45] [-1, 35] / CL ALPHA=0.03125;
  ESTIMATE "Age 55 v 35" AgeCS [1, 55] [-1, 35] / CL ALPHA=0.03125;
  ESTIMATE "Age 55 v 45" AgeCS [1, 55] [-1, 45] / CL ALPHA=0.03125;
  ESTIMATE "Age 65 v 35" AgeCS [1, 65] [-1, 35] / CL ALPHA=0.03125;
  ESTIMATE "Age 65 v 45" AgeCS [1, 65] [-1, 45] / CL ALPHA=0.03125;
  ESTIMATE "Age 65 v 55" AgeCS [1, 65] [-1, 55] / CL ALPHA=0.03125;
  ODS OUTPUT Estimates=E;
RUN;
ODS EXCLUDE NONE;

PROC PRINT DATA=E NOOBS LABEL;
  VAR Label Lower Estimate Upper;
  LABEL Label='00'x;
  FORMAT Estimate 6.3 Lower 6.3 Upper 6.3;
RUN;
  Lower Estimate Upper
Age 35 4.947 5.004 5.061
Age 45 5.224 5.278 5.333
Age 55 5.303 5.349 5.395
Age 65 5.171 5.227 5.284

  Lower Estimate Upper
Age 45 v 35 0.199 0.274 0.350
Age 55 v 35 0.269 0.345 0.421
Age 55 v 45 0.014 0.071 0.128
Age 65 v 35 0.146 0.223 0.300
Age 65 v 45 -0.141 -0.051 0.039
Age 65 v 55 -0.164 -0.122 -0.080

The estimated mean serum cholesterol levels are \((4.947,5.061)\) mmol/L for 35 year olds, \((5.224,5.333)\) mmol/L for 45 year olds, \((0.199,0.350)\) mmol/L greater than 35 year olds; \((5.303,5.395)\) mmol/L for 55 year olds, \((0.269,0.421)\) mmol/L greater than 35 year olds and \((0.014,0.128)\) mmol/L greater than 45 year olds; and \((5.171,5.284)\) mmol/L for 65 year olds, \((0.146,0.300)\) mmol/L greater than 35 year olds, 0.141 mmol/L lower to 0.039 mmol/L greater than 45 year olds, and \((0.080,0.164)\) mmol/L lower than 55 year olds.

We can perform F-tests for the overall age effect and for linearity of the age effect.

ODS EXCLUDE ALL;
PROC GLMSELECT DATA=Analysis;
  EFFECT AgeCS = Spline(Age / 
          NATURALCUBIC BASIS=TPF(NOINT) 
          KNOTMETHOD=PERCENTILELIST(5 27.5 50 72.5 95));
    
  MODEL TotChol = AgeCS / SELECTION=NONE;
  ODS OUTPUT ANOVA=Full;
RUN;

DATA Full (KEEP=SS DF RENAME=(SS=SS_Full DF=DF_Full));
  SET Full;
  WHERE Source="Error";
RUN;

PROC GLMSELECT DATA=Analysis;
  EFFECT AgeCS = Spline(Age / 
          NATURALCUBIC BASIS=TPF(NOINT) 
          KNOTMETHOD=PERCENTILELIST(5 27.5 50 72.5 95));
    
  MODEL TotChol = Age / SELECTION=NONE;
  ODS OUTPUT ANOVA=Linear;
RUN;

DATA Linear (KEEP=Test DF SS);
  SET Linear;
  WHERE Source="Error";

  Test="Non-Linear Age";
RUN;

PROC GLMSELECT DATA=Analysis;
  EFFECT AgeCS = Spline(Age / 
          NATURALCUBIC BASIS=TPF(NOINT) 
          KNOTMETHOD=PERCENTILELIST(5 27.5 50 72.5 95));
    
  MODEL TotChol = / SELECTION=NONE;
  ODS OUTPUT ANOVA=Null;
RUN;

DATA Null (KEEP=Test DF SS);
  SET Null;
  WHERE Source="Error";

  Test="Overall Age";
RUN;

DATA FTests;
  LENGTH Test $50;

  IF _N_ = 1 THEN SET Full;
  SET Linear Null;

  SS = SS-SS_Full;
  DF = DF-DF_Full;
  MS = SS/DF;
  MS_Full = SS_Full/DF_Full;
  F = MS/MS_Full;
  ProbF = EXP(LOGSDF("F",F,DF,DF_Full));
  sValue = -LOG(ProbF)/LOG(2);
    
  KEEP Test DF MS F ProbF sValue;    
RUN;
ODS EXCLUDE NONE;

TITLE "F-Tests";
PROC PRINT DATA=FTests NOOBS;
  VAR Test DF MS F ProbF sValue;  

  FORMAT sValue 6.2;
  FORMAT ProbF PVALUE6.4;
RUN;
TITLE;

F-Tests

Test DF MS F ProbF sValue
Non-Linear Age 3 112.921 106.477 <.0001 221.67
Overall Age 4 115.441 108.853 <.0001 297.30

Prediction

The following graphic and table display prediction intervals (lighter red, LCL-UCL) for (individual) total cholesterol by age and confidence intervals (darker red, LCLM-UCLM) for the mean total cholesterol by age.

DATA Score;
  DO Age = 20 TO 80 BY 1;
    OUTPUT;
  END;
RUN;

ODS EXCLUDE ALL;
PROC PLM RESTORE=SplineModel;
  SCORE DATA=Score OUT=Score Predicted LCLM UCLM LCL UCL / ALPHA=0.03125;
RUN;
ODS EXCLUDE NONE;

DATA All;
  SET Analysis Score(IN=S);
    
  Score = S;
RUN;

PROC SGPLOT DATA=All NOAUTOLEGEND;
  SCATTER X=Age Y=TotChol / MARKERATTRS=(SYMBOL=CircleFilled COLOR=LightBlue);
  BAND X=Age Lower=LCL Upper=UCL / FILLATTRS=(COLOR=Red TRANSPARENCY=0.8);    
  BAND X=Age Lower=LCLM Upper=UCLM / FILLATTRS=(COLOR=Red TRANSPARENCY=0.6);
  SERIES X=Age Y=Predicted / LINEATTRS=(COLOR=Red);
RUN;

PROC PRINT DATA=All NOOBS;
  VAR Age Predicted LCL UCL LCLM UCLM;
  WHERE Score = 1;
RUN;
The SGPlot Procedure 20 30 40 50 60 70 80 Age (yrs) 2.5 5.0 7.5 10.0 12.5 Total Cholesterol (mmol/L)

Age Predicted LCL UCL LCLM UCLM
20 4.51090 2.29027 6.73152 4.41459 4.60720
21 4.54396 2.32376 6.76415 4.45813 4.62979
22 4.57702 2.35718 6.79685 4.50107 4.65296
23 4.61007 2.39053 6.82962 4.54314 4.67701
24 4.64312 2.42380 6.86245 4.58391 4.70234
25 4.67616 2.45699 6.89533 4.62288 4.72944
26 4.70917 2.49009 6.92826 4.65965 4.75869
27 4.74216 2.52311 6.96121 4.69412 4.79020
28 4.77511 2.55604 6.99417 4.72656 4.82365
29 4.80801 2.58891 7.02712 4.75761 4.85841
30 4.84086 2.62170 7.06003 4.78798 4.89374
31 4.87366 2.65443 7.09288 4.81833 4.92898
32 4.90638 2.68711 7.12565 4.84915 4.96362
33 4.93904 2.71974 7.15833 4.88082 4.99726
34 4.97161 2.75232 7.19090 4.91357 5.02965
35 5.00406 2.78481 7.22332 4.94742 5.06071
36 5.03622 2.81702 7.25541 4.98187 5.09056
37 5.06787 2.84874 7.28700 5.01625 5.11949
38 5.09881 2.87974 7.31788 5.04982 5.14780
39 5.12884 2.90981 7.34787 5.08184 5.17583
40 5.15774 2.93872 7.37675 5.11166 5.20381
41 5.18530 2.96628 7.40432 5.13890 5.23171
42 5.21133 2.99228 7.43038 5.16346 5.25920
43 5.23561 3.01651 7.45471 5.18554 5.28568
44 5.25794 3.03878 7.47709 5.20546 5.31042
45 5.27810 3.05890 7.49730 5.22355 5.33265
46 5.29590 3.07666 7.51513 5.24010 5.35169
47 5.31116 3.09192 7.53040 5.25528 5.36704
48 5.32392 3.10471 7.54313 5.26903 5.37881
49 5.33425 3.11508 7.55341 5.28116 5.38733
50 5.34221 3.12309 7.56132 5.29140 5.39302
51 5.34788 3.12882 7.56694 5.29942 5.39634
52 5.35133 3.13231 7.57035 5.30487 5.39779
53 5.35263 3.13364 7.57163 5.30743 5.39784
54 5.35186 3.13287 7.57085 5.30689 5.39682
55 5.34908 3.13007 7.56808 5.30329 5.39487
56 5.34436 3.12532 7.56340 5.29685 5.39187
57 5.33778 3.11869 7.55687 5.28800 5.38757
58 5.32941 3.11026 7.54856 5.27722 5.38160
59 5.31932 3.10012 7.53851 5.26498 5.37365
60 5.30759 3.08835 7.52683 5.25157 5.36360
61 5.29430 3.07503 7.51357 5.23714 5.35145
62 5.27953 3.06025 7.49882 5.22181 5.33726
63 5.26338 3.04409 7.48266 5.20562 5.32113
64 5.24591 3.02663 7.46518 5.18860 5.30321
65 5.22721 3.00796 7.44646 5.17074 5.28367
66 5.20736 2.98814 7.42659 5.15201 5.26272
67 5.18645 2.96726 7.40564 5.13232 5.24058
68 5.16456 2.94539 7.38372 5.11159 5.21752
69 5.14176 2.92262 7.36090 5.08969 5.19383
70 5.11814 2.89901 7.33728 5.06648 5.16980
71 5.09379 2.87465 7.31293 5.04183 5.14574
72 5.06878 2.84961 7.28795 5.01566 5.12190
73 5.04320 2.82398 7.26242 4.98792 5.09847
74 5.01712 2.79782 7.23642 4.95868 5.07556
75 4.99064 2.77122 7.21005 4.92807 5.05321
76 4.96382 2.74426 7.18338 4.89626 5.03139
77 4.93676 2.71702 7.15651 4.86346 5.01007
78 4.90954 2.68958 7.12950 4.82990 4.98918
79 4.88224 2.66202 7.10245 4.79578 4.96869
80 4.85492 2.63441 7.07543 4.76130 4.94854